Open
Conversation
Collaborator
📊 Performance Test ResultsComparing 4fa17d4 vs trunk app-size
site-editor
site-startup
Results are median values from multiple test runs. Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff) |
Contributor
Author
|
Will update this after #3360 lands. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
resultSubtype,resultIsError,resultStopReason,resultText, andresultErrors.STUDIO_EVAL_INCLUDE_TRANSCRIPT=1, with text/tool-result truncation to avoid bloating default artifacts.Why
This continues the eval-runner observability work from #3273 and #3330. Those PRs made phase/tool timings, first tool errors, loop exceptions, and timeouts visible. This adds the final SDK result shape and an opt-in turn transcript so eval consumers can distinguish model behavior, PI harness continuation behavior, runner classification, and downstream benchmark quality gates.
The need surfaced while testing the Static Site Importer draft path in #3309 with the Studio site-build benchmark. GPT-5.5 repeatedly produced tool-only runs for built-in prompt variants (
restaurant,wordpress-is-dead):site_list/site_inforeturned successfully, then the SDK emittedsubtype: \"success\",stopReason: \"end_turn\", and an emptyresult. No assistant text,Write,wp_cli, or import report was produced, but the eval runner classified the run as successful because it trustedmessage.subtype === 'success'.With the local transcript diagnostics enabled, that failure shape was clear. The same diagnostics also helped compare Claude Sonnet 4.6 on the same SSI site-build flow: Claude generated source HTML and wrote files, then timed out while repairing generated helper-script errors before reaching import. Different failure mode, same need for better eval evidence.
This PR keeps the transcript opt-in so normal eval artifacts only gain a few scalar fields, while deeper debugging remains available when investigating model/runtime/harness regressions.
Validation
npm installto bootstrap the clean worktree.npm run cli:build --silent— passed.npx eslint apps/cli/ai/eval-runner.ts— passed.npm -w wp-studio run typecheck— passed.git diff --check— passed.AI assistance